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Introduction 


Background 


We are seeing more processors rather than faster ones. 

The challenge now is to find ways of using multiple cores 
effectively to improve the performance of a single program. 

NB: No interest here in efficiency 


□ ► < S 1 ► 


iBATH 


1 -o QvO 


ffitch 


ParCS 



Historical Note 


Background 


I first stated this in the early 1970s, and at intervals since, but 
now the need is much more imminent! 
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Background 


The Hardware Imperative 


Robert S. Barton, one the greats of early computing, said there 
is a technological imperative ; what hardware requires is a 
forcing term on software. 
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Background 


The Hardware Imperative 


Robert S. Barton, one the greats of early computing, said there 
is a technological imperative ; what hardware requires is a 
forcing term on software. 

Attempts at parllelism at that time did not succeed, but we can 
learn from the experience. 
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Background 


A Brief Biased History of Parallelism 


Forty years ago I was proposing a parallel functional machine, 
and thirty years ago we built to Bath Concurrent LISP Machine, 
a cluster of six M68000 processors with each processor having 
three shared memory windows with one other. 
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Background 


A Brief Biased History of Parallelism 


Forty years ago I was proposing a parallel functional machine, 
and thirty years ago we built to Bath Concurrent LISP Machine, 
a cluster of six M68000 processors with each processor having 
three shared memory windows with one other. 

Twenty years ago we built the a LISP-based Concurrent 
Object-Oriented system. 
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Background 


Concurrent Software 


We based our work on the premis that users cannot be 
expected (or trusted) to modify their thinking for parallel 
execution, and the responsibility needs to be taken by the 
software translation system that converts the program or 
specification into an executable form. 
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Background 


Concurrent Software 


We based our work on the premis that users cannot be 
expected (or trusted) to modify their thinking for parallel 
execution, and the responsibility needs to be taken by the 
software translation system that converts the program or 
specification into an executable form. 

Compiler analysis can be extended to inform the structure; 
described variously in PhD Thesis of Marti (1980), and papers 
by me in computer algebra. 
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Background 


The Two Critical Points 


A: Two entities can be run at the same time if they do not 
reference/modify shared data 
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Background 


The Two Critical Points 


A: Two entities can be run at the same time if they do not 
reference/modify shared data 

B: Two entities should be run at the same time if the overhead 
is less than the gain 
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Background 


Ab Initio Parallelism? 


Should we just start coding again? I say not as there is too 
much already committed. 
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Background 


Ab Initio Parallelism? 


Should we just start coding again? I say not as there is too 
much already committed. 

Two attempts however are worth mentioning; 

« Csound in real-time using Transputers 
9 Midas streamed DSP network 
Both are finer-grained than what I am advocating 
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Background 


Towards a Parallel Csound 


Csound has been in existence and development for 25 years. It 
provides instruments that are played following a score. 

The instruments are activated, performed and deactivated 
using a control cycle (running at a control rate). Instruments are 
performed in a defined order, and so interaction between 
instruments has defined behaviour. 
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Background 


Towards a Parallel Csound 


Csound has been in existence and development for 25 years. It 
provides instruments that are played following a score. 

The instruments are activated, performed and deactivated 
using a control cycle (running at a control rate). Instruments are 
performed in a defined order, and so interaction between 
instruments has defined behaviour. 

until end of events do 
deal with notes ending 
sort new events onto instance list 
for each instrument in instance list 
calculate instrument 
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Background 


Towards a Parallel Csound (b) 


Making this parallel could be to make the loop parallel, as long 
as there is no interaction, so.... 
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Background 


Towards a Parallel Csound (b) 


Making this parallel could be to make the loop parallel, as long 
as there is no interaction, so.... 

a Following Marti we can use code analysis techniques 
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Background 


Towards a Parallel Csound (b) 


Making this parallel could be to make the loop parallel, as long 
as there is no interaction, so.... 

« Following Marti we can use code analysis techniques 
o Only global variables matter. 
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Background 


Towards a Parallel Csound (b) 


Making this parallel could be to make the loop parallel, as long 
as there is no interaction, so.... 

« Following Marti we can use code analysis techniques 
o Only global variables matter. 

o For each instrument determine the sets of globals are 
read, written, or both 
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Background 


Towards a Parallel Csound (b) 


Making this parallel could be to make the loop parallel, as long 
as there is no interaction, so.... 

« Following Marti we can use code analysis techniques 
o Only global variables matter. 

o For each instrument determine the sets of globals are 
read, written, or both 

9 Use this to control the loop 
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Background 


Towards a Parallel Csound (c) 


Special case: most instruments add into the output bus, but this 
is not an operation that needs ordering (subject to rounding 
errors), although it may need a mutex or spin-lock. The 
language processing can insert any necessary protections in 
these cases. 
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Background 


Towards a Parallel Csound (c) 


Special case: most instruments add into the output bus, but this 
is not an operation that needs ordering (subject to rounding 
errors), although it may need a mutex or spin-lock. The 
language processing can insert any necessary protections in 
these cases. 


There are other globals than variables but the idea is the same. 
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Design 


Background 


Build a DAG of ordering dependancy, where the arcs represent 
the need to be evaluated before the descendents 
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Design 


Background 


Build a DAG of ordering dependancy, where the arcs represent 
the need to be evaluated before the descendents 

until end of events do 
deal with notes ending 

add new events and reconstruct the DAG 
until DAG empty 

foreach processor 

evaluate a root from DAG 
wait until all processes finish 
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Compiler Example 


Background 


Using the new parser the information is “easily” gathered and 
the bus-locks inserted. 


instr 1 

al oscil p4, 
out al 

endin 
instr 2 

gk oscil p4, 
endin 
instr 3 
al oscil gk, 
out al 


p5, 


p5, 


p5, 


endin 


1 


1 


1 


□ ► < S 1 ► 


111 BATH 

1 -O^O 


ffitch 


ParCS 



Background 
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Background 


Maintaining the DAG 


This is a major problem. It is consumed on each cycle, but 
adding and losing instances means DAG must be remade, not 
just copied. The current version of representation and algorithm 
is the result of much experimentation and probably could be 
improved. 
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Background 


Locking and Barriers 


We use the POSIX pthreads library. 


ffitch 


ParCS 


HI BATH 

1 - 0^0 



Background 


Locking and Barriers 


We use the POSIX pthreads library. 
o One master thread does analysis and DAG construction 
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Background 


Locking and Barriers 


We use the POSIX pthreads library. 

9 One master thread does analysis and DAG construction 
9 A Barrier at the start of each control cycle 
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Background 


Locking and Barriers 


We use the POSIX pthreads library. 

9 One master thread does analysis and DAG construction 
9 A Barrier at the start of each control cycle 
9 Each worker gets a task from the DAG, with a mutex 


□ ► < S 1 ► 


iBATH 


1 -o QvO 


ffitch 


ParCS 



Background 


Locking and Barriers 


We use the POSIX pthreads library. 

9 One master thread does analysis and DAG construction 
9 A Barrier at the start of each control cycle 
9 Each worker gets a task from the DAG, with a mutex 
9 At end of instrument-cycle DAG is modified 
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Background 


Locking and Barriers 


We use the POSIX pthreads library. 

9 One master thread does analysis and DAG construction 
9 A Barrier at the start of each control cycle 
9 Each worker gets a task from the DAG, with a mutex 
9 At end of instrument-cycle DAG is modified 
9 When no work proceed to end Barrier 
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Load Balancing 


Background 


We would like each task to be equal computation and 
sufficiently large. 
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Load Balancing 


Background 


We would like each task to be equal computation and 
sufficiently large. 

This is not always true and currently we ignore this problem 
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Load Balancing 


Background 


We would like each task to be equal computation and 
sufficiently large. 

This is not always true and currently we ignore this problem 
Code exists to collect instances together. 
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Load Data 


Background 


We are collecting data on average instruction count for 
opcodes, using valgrind. 

We calculate three counts; initialisation, per k-cycle, per sample 
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Background 


Costs of a few opcodes 


Opcode 

init 

Audio 

Control 

table.a 

93 

23.063 

43.998 

table.k 

93 

0 

45 

butterlp 

9 

29.005 4 

5.478 

butterhi 

19 

30.000 

35 

butterbp 

20 

30 

71 

bilbar 

371.5 

1856.028 

86 

ags 

497 

917.921 

79475.155 

oscil.kk 

69 

12 

47 

oscili.kk 

69 

21 

49 

reverb 

6963.5 

77 

158 
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Current State 


Background 


Implemented by Chris Wilson, revised by John ffitch. Tested on 
Linux (and OSX). Requires the new parser but is available on 
Sourceforge as a branch. Can control number of threads. 
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Current State 


Background 


Implemented by Chris Wilson, revised by John ffitch. Tested on 
Linux (and OSX). Requires the new parser but is available on 
Sourceforge as a branch. Can control number of threads. 

Some features still missing, like zak and buses. 
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Background 


Linux Quadcore Results 


Sound 

ksmps 

1 



5 

Time 

Xanadu 

1 

31.202 

39.291 

42.318 

43.043 

48.304 

Xanadu 

10 

18.836 

19.901 

20.289 

21.386 

22.485 

Xanadu 

100 

16.023 

17.413 

16.999 

16.545 

15.884 

Xanadu 

300 

17.159 

16.137 

15.141 

15.723 

14.905 

Xanadu 

900 

16.004 

15.099 

13.778 

14.364 

14.167 

CloudStrata 

1 

173.757 

191.421 

211.295 

214.516 

261.238 

CloudStrata 

10 

89.406 

80.998 

94.023 

110.170 

98.187 

CloudStrata 

100 

85.966 

86.114 

81.909 

83.258 

85.631 

CloudStrata 

300 

87.153 

76.045 

79.353 

78.399 

74.684 

CloudStrata 

900 

82.612 

76.434 

64.368 

76.217 

74.747 

trapped 

1 

20.931 

63.492 

81.654 

107.982 

139.334 

trapped 

10 

3.348 

7.724 

9.500 

12.165 

14.937 

trapped 

100 

1.388 

1.810 

1.928 

2.167 

2.612 

trapped 

300 

1.319 

1.181 

1.205 

1.386 

1.403 

trapped 

900 

1.236 

1.025 

1.085 

1.091 

1.112 
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Performance 


Background 


As the control rate decreases, corresponding to an increase in 
ksmps, the potential gain increases. This suggests that the 
current system is using too small a granularity and the 
collecting of instruments into larger groups will give a 
performance gain. 
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Performance 


Background 


As the control rate decreases, corresponding to an increase in 
ksmps, the potential gain increases. This suggests that the 
current system is using too small a granularity and the 
collecting of instruments into larger groups will give a 
performance gain. 

The performance figures are perhaps a little disappointing, but 
they do show that it is possible to get speed improvements, and 
more work on the load balance could be useful. 
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Conclusions 


Background 


A system for parallel execution of Csound has been presented, 
that works at the granularity of the instrument, based on 
thirty-year old technology. 
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Conclusions 


Background 


A system for parallel execution of Csound has been presented, 
that works at the granularity of the instrument, based on 
thirty-year old technology. 

I believe that the level of granularity is the correct one, and with 
more attention to the DAG construction and load balancing it 
offers real gains for many users. It does not require specialist 
hardware, and can make use of current and projected 
commodity systems. 
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Background 
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